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A simple criterion is presented for a practical construction of generalized moments that allow one to ap- 
proach the theoretical Rao-Cramer limit for parameter estimation while avoiding the complexity of the 
maximum likelihood method in the cases of complicated probability distributions and/or very large event 
samples. 



Introduction. The purpose of this note is to describe a 
result that was discovered in a rather special context of the 
theory of so-called jet finding algorithms [1] but seems to be 
basic enough to belong to the core statistical wisdom of pa- 
rameter estimation. 

Namely, I would like to present a simple formula (Eq. (20)) 
that connects the method of generalized moments with the 
maximum likelihood method by explicitly describing devia- 
tions from the Rao-Cramer limit on precision of parameter es- 
timation with a given event sample; see e.g. [2], [3]. 

The formula leads to practical prescriptions (the method of 
quasi-optimal moments"; see after Eq.(24)) that offer a practi- 
cal alternative to the maximum likelihood method in precision 
measurement problems when the use of the maximum likeli- 
hood method is impractical due to complexity of theoretical 
expressions for the probability distribution or a large size of 
the sample of events. 

Although closely related to the well-known results and 
mathematical techniques, the prescription is new to the extent 
that I've seen no trace in the literature of its being known to 
physicists despite its immediate relevance to precision meas- 
urements. 

The problem. One deals with a random variable P 
whose instances (specific values) are called events. Their 
probability density is denoted as 7r(P). It is assumed to depend 
on a parameter M which has to be estimated from an experi- 
mental sample of events {P;};. 

The standard method of generalized moments consists in 
choosing a function /(P) defined on events (the generalized 
moment), and then finding M by fitting its theoretical average 
value, 

(/) = JdP;r(P)/(P), (1) 
against the corresponding experimental value: 



(2) 



The problem is to find / which would allow one to extract M 
with the highest precision from the event sample. 



" In the quantum-theoretic context of [1] generalized moments are natu- 
rally interpreted as quantum observables, so the method was called the 
method of quasi-optimal observables. 



Optimal moments, in the context of precision meas- 
urements one can assume the magnitude of errors to be small. 
Then fluctuations in the values of M are related to fluctuations 

in the values of (/} as 

5M = (^] 5{f). (3) 



1, dM 

The derivative is appUed only to the probability distribution; 



dM 



= jdP/(P)- 



(4) 



This is because M is unknown, so even though the solution, 
/opt, will depend on M, any such dependence is coincidental 
and therefore "frozen" in this calculation. 



For small fluctuations 5(/> = A^-''^ Var(/) , where 
Var (/) = jdP;r(P) (/(P) -{f)Y^{P)- {ff . (5) 



In terms of variances, Eq. (3) becomes: 



VarM[/] = 



Var/. 



(6) 



{dM J 

The problem is to minimized this by a suitable choice off. 

A necessary condition for a minimum can be written in 
terms of functional derivatives;'' 



Sf(P) 



VarM[/] = 0. 



(7) 



Substitute Eq. (6) into (7) and use the following relations; 



An interesting mathematical exercise of casting the following reasoning 
(the functional derivatives, etc.) into a rigorous form is left to interested 
mathematical parties. A premature emphasis on rigor would have obscured 
the simple analogy with the study of minima of ordinary functions via the 
usual Taylor expansion. 

For practical purposes it is sufficient to remember that the range of va- 
Udity of the prescriptions we obtain is practically the same as for the 
maximum likelihood method. Note that the derivation in terms of func- 
tional derivatives can be related to the proofs of the Rao-Cramer inequaUty 
in terms of HUbert statistics, etc.; cf e.g. [4]. 
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^ (/) = ^(P). T:^</') = 2/(P);r(P), 



df(P) 3f(P) 

S d{f} _ dniP) 
5f(P) dM dM 

After some simple algebra one obtains: 
aiii;r(P) 



/(P) = (/) + const - 



dM 



(8) 



(9) 



where the constant is independent of P. The constant plays no 
role since / is defined by this reasoning only up to a constant 
factor. Noticing that 



jdP;r(P) 



ain;r(P) 3 



fdP;!:(P) = 1 = 0, 

J 7\M 



dM dM J dM 
we arrive at the following general family of solutions: 

^ ^ dM ^ 



(10) 



(11) 



where C, are independent of P but may depend on M . 

For convenience of formal investigation we will usually deal 
with the following member of the family (11): 

^ 9ln;r(P) 

Then Eq. (10) is essentially the same as 

(/opt) = 0. (13) 

A SIMPLE EXAMPLE. Consider the familiar Breit-Wigner 
shape. Let P be random real numbers distributed according to 



^(P) = 



{M-py+T^ 



(14) 



in some fixed interval around P = M . Suppose M is unknown. 
Then the optimal moment is 



, , • (15) 
{M-Pf + T^ 

(Remember that P-independent additive and multiplicative 
constants can be dropped in such expressions; see Eq. (11).) 

It is interesting to observe how/M,opi emphasizes contribu- 
tions of the slopes of the bump — exactly where the magnitude 
of n{P) is most sensitive to variations of M — and taking con- 
tributions from the two slopes with a different sign maximizes 
the signal. At the same time the expression (15) suppresses 
contributions from the middle part of the bump (14) that gen- 
erates mostly noise as far as M is concerned. 

Connection with maximum likelihood. Eq.(12) 
can be regarded as a translation of the method of maximum 
likelihood (which is known to yield the theoretically best esti- 
mate for M; cf the Rao-Cramer inequality [2], [3]) into the 
language of generalized moments."^ Indeed, the maximum like- 
lihood method prescribes to estimate M by the value which 
maximizes the likelihood function. 



ln;r(P,.) , 



(16) 



where summation runs over all events from the sample. The 
necessary condition for the maximum of (16) is 



^[E,-<p,.]=E,^-(/,U=o. 

This agrees with (12) thanks to (13). 



(17) 



Deviations from /opi. Next we are going to consider 
how small deviations from /opt affect the precision of extracted 
M. Consider (6) as a functional of /, VarM [/]. Assume <p is & 
function of events such that {(p^ )<oo . We are going to evalu- 
ate the functional Taylor expansion of VarM [/opt+<p] with re- 
spect to (p through quadratic terms: 

VarM[/„p,+(p] = VarM[/„pJ 

' S^VaxMUi 



5f{P)5f{Q) 



f=f. 



(p{P)(p{Q)dPdQ + 



(18) 



opt 



(12) SfiP) 



The term which is linear in (p does not occur because /opt satis- 
fies (7). 

To evaluate the quadratic term in (18), it is sufficient to use 
functional derivatives and relations such as (8) and 

^ /(Q) = 5(P,Q), j6(P,Q)(p(P)dP = (p(Q). (19) 



A straightforward calculation yields our main technical result: 



VarM[/op, 
1 



(/opt) 



77^{(/o'p,)x(r) - {fo,tXff} + - (20) 

\/opt/ 



where (p = (p-{(p) . 

Non-negativity of the factor in curly braces follows from the 
standard Schwartz inequality.'' 

The first term on the r.h.s. of (20), (/opt) ' , is the absolute 

minimum for the variance of M as established by the Rao- 
Cramer inequality [2], [3]. The latter is valid for all (p and 
therefore is somewhat stronger than the result (20) which we 
have obtained only for sufficiently small (p. However, Eq. (20) 
gives a simple explicit description of the deviation from opti- 
mality and so makes possible the practical prescriptions pre- 
sented below after Eq. (24). 

It is convenient to talk about informativeness If of a gener- 
alized moment / with respect to the parameter M, defined by 



/^=(VarM[/])" . 

The informativeness of /opt is 



opt 



= (/opt) ■■ 



(21) 



(22) 



which corresponds to the Rao-Cramer limit. And the expansion 
(20) explicitly describes the deviations from the limit. 

Informativeness is closely related to Fischer's information 
[2], [3] which, however, is an attribute of data whereas infor- 
mativeness is a property of the moment. 



" Rather surprisingly, none of a dozen or so textbooks and monographs on 
mathematical statistics that I checked (including a comprehensive practical 
guide [2] and a comprehensive mathematical treatment [3]) explicitly for- 
mulated the prescription in terms of the method of moments although 
equivalent formulas do occur e.g. in simple examples of specific estimates 
for the parameters of standard distributions; cf [4]. 



'' Note that the Schwartz inequality figures in standard rigorous proofs of 
the Rao-Cramer theorem. 
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The method of ouasi-optimal moments. The 
fact that the solution (12) is the point of a quadratic minimum 
means that any moment /quasi which is close to (12) would be 
practically as good as the optimal solution (we will call such 
moments Quasi-optimal ). A quantitative measure of closeness 
is given by comparing the 0(1) and 0{<p^) terms on the r.h.s. 
of (20): 



(/o^.)(r)- 



« 1, 



opt) 



(23) 



where (p = f^^^^^ - ) - f„^^ . 

The subtracted term in the numerator of (23) is non- 
negative, so dropping it results in a sufficient condition for 
Eq.(23). Furthermore, (/opt^) would tend to be suppressed 

anyway whenever /quasi oscillates around /opt. Assuming with- 
out loss of generality that (/quasi) = 0, we obtain the following 

convenient sufficient criterion: 



( [/quasi /opt ] ) (/opt ) • 



(24) 



Taking into account this and Eq. (20) and denoting the usual 
CT for M for the optimal and quasi-optimal cases as CTopt and 
Cquasi, respectively, one obtains: 

"'quasi , 1 ( [/ quasi ~ /opt ] ) 



'opt 



(/opt) 



(25) 



Now the method of quasi-optimal moments is as follows: 

(i) construct a generalized moment /quasi using (12) as a 
guide so that /quasi were close to /opt in the integral sense of 
Eq.(24); 

(ii) find M by fitting (/q^,,;) against (/quasi )exp ' 

(iii) estimate the error for M via (6); 

(iv) /quasi may depend on M to find which one can optionally 
use an iterative procedure starting from some value Mg close to 
the true one. 

For practical construction of quasi-optimal moments /quasi it 
is useful to reformulate (24) in terms of integrands. The ex- 
plicit form for (24) is 

jdP;r(P)[/q„,,i(P)- /„p,(P)f « jdP;r(P)/„2,(P) . (26) 

As a rule of thumb, one would aim to minimize the bracketed 
expression on the l.h.s. of (26): 



[/quasi(P)-/opt(P)f «/oi(P)- 



(27) 



This should hold for "most" P, i.e. taking into account the 
magnitude of :;r(P): the inequality (27) may be relaxed in the 
regions which yield small contributions to the integral on the 
Lh.s. of(26). 

The example (14). Suppose the exact probability distri- 
bution differs from (14) by, say, a mild but complicated de- 
pendence of r on P (as seen e.g. from some sort of perturbative 
calculations of theoretical corrections — a situation typical of 
high-energy physics problems [5]). Then the r.h.s. of (15) with 
a constant F would correspond to a generalized moment which 
is only quasi-optimal but deviations from optimality may be 



practically negligible (depending on the "mildness" of the P- 
dependence). So one could still use the moment given by the 
simplest formula (15) without significant loss of informative- 
ness. 

Alternatively, one could replace the analytical shape (15) by 
cruder piecewise constant or, better, piecewise linear approxi- 
mations that would imitate the expression (15): 



^(P) 



/opt(P) 



M 
(a) 



^optv ' 



f (P) 

^ quasi V' / 



(28) 



(b) 



(c) 



(d) 



In either case, the effect of non-optimality can be easily es- 
timated via Eq. (25): the piecewise linear shape (d) deviates 
from optimality in the sense of (25) by a few per cent (in infi- 
nite domains, the slowly decreasing tails of the probability 
distribution may spoil this conclusion somewhat so one may 
wish to extend /quasi by additional linear pieces as well as in- 
sert flat linear pieces at the sharp peaks). 

Discussion. Eq.(27) allows one to talk about non- 
optimality of moments (i.e. their lower informativeness com- 
pared with /opt) in terms of sources of non-optimality, i.e. the 
deviations of /quasi (P) from /opt (P) which give sizeable contri- 
butions to the l.h.s. of (24). The simplest example is when /opt 
is a continuous smoothly varjdng function whereas /quasi is a 
piecewise constant approximation (see (28), figure (c)). Then 
/quasi would usually deviate most from /opt near the discontinui- 
ties which, therefore, are naturally identified as sources of non- 
optimality. Then a natural way to improve /quasi is by 
"regulating" discontinuities via continuous (e.g. linear) inter- 
polations. 

Intuitively, one could think about sources of non-optimality 
as "leaks" through which information about M is lost, and the 
improvement of /quasi would then correspond to patching up 
those leaks. 

It is practically sufficient to take Eq. (12) at some value 
M=Mo close to the true one (which is unknown anyway). This 
is usually possible in the case of precision measurements. One 
could also perform an iterative procedure for M starting from 
Mo, then replacing Mo with the value newly found, etc. — a 
procedure closely related to the optimization in the maximimi 
likelihood method. 

If n{P) is given by a perturbation theory with increasingly 
complex but decreasingly important contributions, it is possible 
to use an approximate shape for the r.h.s. of (12) such as given 
by a few terms of a perturbative expansion in which the de- 
pendence on the parameter manifests itself. Theoretical up- 
dates of the complete 7r(P) need not be always reflected in the 
quasi-optimal moments. 

If the dimensionality of the space of events is not large then 
it may be possible to construct a suitable /quasi in a brute force 
fashion, i.e. build an interpolation formula for 7t(P) for two or 
more values of M near the value of interest, £ind perform the 
differentiation in M numerically. 

Also, one can use different expressions for /quasi: e.g. per- 
form a few first iterations with a simple shape for faster calcu- 
lations and then switch to a more sophisticated interpolation 
formula for best precision. 
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Several parameters. With several parameters to be 
extracted from data there are the usual ambiguities due to 
reparametrizations but one can always define a moment per pa- 
rameter according to (12). Then the informativeness (21) is a 
matrix (as is Fischer's information). 

Since the covariance matrix of (quasi-) optimal moments is 
known (or can be computed from data), the mapping of the cor- 
responding error ellipsoids for different confidence levels from 
the space of moments into the space of parameters is straight- 
forward. 

Optimal moments and the least squares 

method. The popular x'' method makes a fit with a number 
of non-optimal moments (bins of a histogram). The histogram- 
ming implies a loss of information but the method is universal, 
verifies the probability distribution as a whole, and is imple- 
mented in standard software routines. On the other hand, the 
choice of /quasi requires a problem-specific effort but then the 
loss of information can in principle be made negligible by suit- 
able adjustments of /quasi. 

The balance is, as usual, between the quality of custom so- 
lutions and the readiness of universal ones. However, once 
quasi-optimal moments are foimd, the quality of maximum 
likelihood method seems to become available at a lower com- 
putational cost. 

The two methods are best regarded as complementary: One 
could first employ the '/^ method to verify the shape of the 
probability distribution and obtain the value of Mo to be used 
as a starting point in the method of quasi-optimal moments in 
order to obtain the best final estimate for M. 

An additional advantage of the method of quasi-optimal 
moments may be that some of the more sophisticated theoreti- 
cal formalisms yield predictions for probability densities in the 
form of singular (and therefore not necessarily positive-definite 
everywhere) generalized functions (cf. the systematic gauge- 
invariant quantum-field-theoretic perturbation theory with un- 
stable particles outlined in [6]). In such cases theoretical pre- 
dictions for generalized moments (quasi-optimal or not) may 
exceed in quality predictions for probability densities, so that 
the use of the method would be somewhat disfavored com- 
pared with the method of quasi-optimal moments for the high- 
est-precision measurements of unknown parameters. 

Note that the data processing for the LEPl experiments [5] 
has been performed in several iterations over several years and 
it would have been entirely possible to design, say, five quasi- 
optimal moments for the five parameters measured at the Z 
resonance back in the '80s and to use them ever since. 

Conclusions, it is clear that the method of quasi- 
optimal moments may be a useful addition to the data- 
processing arsenal e.g. in situations encountered in precision 
measurement problems in high-energy particle physics (cf [5]) 
where one deals with 0(10*) events and very complicated 
probability distributions obtained via quantum-field-theoretic 
perturbation theory so that the optimization involved in the 
maximum likelihood method is unfeasible. It also does not 
seem impossible to design universal software routines for a 
numerical construction of /quasi in the form of dynamically gen- 
erated interpolation formulas. 

Lastly, the usefulness of the concept of quasi-optimal mo- 
ments is not limited to purely numerical situations: It also 
proved to be useful in a theoretical context of [1] as a guiding 
principle for studying an important class of data processing al- 
gorithms (the so-called jet finding algorithms). 
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